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Abstract 

Minwisqj hashing is the standard technique 
in the context of search and databases for effi- 
ciently estimating set (e.g., high-dimensional 
0/1 vector) similarities. Recently, b-bit min- 
wise hashing was proposed which signifi- 
cantly improves upon the original minwise 
hashing in practice by storing only the low- 
est b bits of each hashed value, as opposed to 
using 64 bits, b-bit hashing is particularly 
effective in applications which mainly con- 
cern sets of high similarities (e.g., the resem- 
blance > 0.5). However, there are other im- 
portant applications in which not just pairs of 
high similarities matter. For example, many 
learning algorithms require all pairwise sim- 
ilarities and it is expected that only a small 
fraction of the pairs are similar. Further- 
more, many applications care more about 
containment (e.g., how much one object is 
contained by another object) than the resem- 
blance. In this paper, we show that the es- 
timators for minwise hashing and b-bit min- 
wise hashing used in the current practice can 
be systematically improved and the improve- 
ments are most significant for set pairs of low 
resemblance and high containment. 

1 Introduction 

Computing the size of set intersections is a fundamen- 
tal problem in information retrieval, databases, and 
machine learning. For example, binary document vec- 
tors represented using u;-shingles can be viewed either 
as vectors of very high dimensionality or as sets. The 
seminal work of minwise hashing [51 [5] is a standard 
tool for efficiently computing resemblances (Jaccard 
similarity) among extremely high-dimensional (e.g., 
2 64 ) binary vectors, which may be documents repre- 
sented by w-shingles (w-grams, w contiguous words) 
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with w — 5 or 7 OH]- Minwise hashing has been 
successfully applied to a very wide range of real- world 
problems especially in the context of search; a partial 

list includes IS SI EH IZ1 123 IS1 HZI QQ1 151 QS1 

[25]. 

The resemblance, R, is a widely used measure of sim- 
ilarity between two sets. Consider two sets Si, S2 C 
= {0, 1, 2, D — 1}, where D, the size of the dictio- 
nary, is often set to be D = 2 64 in industry practice. 
Denote a = ISi nS^I. R is defined as 



R 



Si n S2I a 
|5iU5 2 | = h + f2-a 



/i = |Si|, / 2 = |S 2 | 



Minwise hashing applies a random permutation it : 
£1 — > Q on Si and S 2 . Based on an elementary proba- 
bility result: 

Pr (mhiMSi)) = min(7r(S 2 ))) = = R, (1) 

one can store the smallest elements under n, i.e., 
min(7r(Si)) and min(7r(S 2 )), and then repeat the per- 
mutation k times to estimate R. After k minwise inde- 
pendent permutations, 7Ti, 7r 2 , irk, one can estimate 
R without bias, as: 



1 fe 

R.M = T X! 1 { min ( 7r J'( 5 'l)) = min (^j( S 2))}, 



3=1 



Var [R 



-i?(l - R). 



(2) 



(3) 



1 First draft in March, slightly modified in June, 2011. 



The common practice is to store each hashed value, 
e.g., min(7r(Si)) and min(7r(S 2 )), using 64 bits [12]. 
The storage cost (and consequently the computational 
cost) will be prohibitive in large-scale applications 24 . 

It is well-understood in practice that one can reliably 
replace a permutation with a reasonable hashing func- 
tion; see the original minwise hashing paper [2] and the 
followup theoretical work [3J. In other words, there is 
no need to store these k permutations. 



In this paper, we first observe the standard practice of 
minwise hashing, i.e., using ([2]), can be substantially 
improved for important scenarios. In fact, we will show 
that ^ is optimal only when the sets are of the same 
size, i.e., f\ — f 2 , which is not too common in practice. 

Figure [T] presents an example based on the webspam 
dataset (available from the LibSVM site), which con- 
tains 350000 documents represented using binary vec- 
tors of D = 16 million dimensions. Compared to the 
Web scale datasets with billions of documents in 2 64 di- 
mensions, webspam is relatively small and only uses 3- 
grams. Nevertheless, this example demonstrates that 
the set sizes (numbers of non-zeros), fi — \Si\, dis- 
tribute in a wide range. Therefore, when we compare 
two sets, say Si and S 2 , we expect the ratio /1//2 will 
often significantly deviate from 1. 
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Figure 1: Histograms of fi (number of non-zeros in the 
i-th data vector) in webspam dataset. 

Indeed, we computed the ratios /1//2 for all pairs in 
webspam. Without loss of generality, we always as- 
sume /1 > f 2 . There are altogether 61 billion pairs 
with the mean /1//2 = 5.5 and the standard deviation 
(std) = 9.5. Thus, we expect that f 2 /fi = 0.2 - 0.5 
is common and /2//1 < 0.1 is also fairly frequent. 

1.1 The 3-Cell Multinomial Problem 

The standard estimator ([2]) is based on a binomial dis- 
tribution. However, the problem really follows a 3-cell 
multinomial distribution. Define z\ — min(7r(Si)) and 
z 2 = min(7r(S , 2)). The three probabilities are: 



P = = Pr ( Zl = z 2 ) 
P< = Pr ( Zl < z 2 ) 
P> = Pr ( Zl > z 2 ) 
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R, 



(4) 
(5) 
(6) 



These probabilities are easy to understand. For exam- 
ple, for the event {z\ < z 2 }, the size of sample space 
is I Si U 52 1 = fx + f 2 — a and the size of event space is 
I Si - Si n S 2 | = /1 - a, and hence P< = fl f ±]"_ a . 



We will show that the estimator solely based on P— 
(|4]) is optimal only when f\ — f 2 . Assuming fi > f 2 , 
then ([5]) should not be used for the estimation task. 
The estimator based on P> © is superior to P = © 
when fi>f 2 ~a. However, since we do not know a 
in advance, we must combine all three probabilities to 
ensure accurate estimates. 

1.2 The Measure of Containment 

The ratio T = a/ f 2 (assuming f\ > f 2 ) is known as 
the containment. It is possible that the resemblance 
R is small but the containment T is large. Note that 
R = h+h-a - a //i - h/fi- Thus > if; for example, 
/2//1 < 0.2, then R has to be small, even when a ~ f 2 
(which corresponds to T w 1). 

While the literature on minwise hashing has mainly 
focused on the estimation of set resemblance, accu- 
rate estimation of set containment is also crucial to 
a number of different applications. For example, [3] 
uses both resemblance and containment estimates of 
the w-grams contained in text columns to characterize 
the similarity of database table contents in a tool that 
allows users to quickly understand database content. 
In a similar context, [3T] tests the (estimated) level of 
containment between the distinct values contained in 
different (sets of) database columns to automatically 
detect foreign key constraints. [25] describes the use 
of (estimated) shingle containment in the context of 
cluster-based compression schemes. In the context of 
overlay networks, [5] uses the estimated containment 
(and resemblance) of the working sets of peers to co- 
ordinate between them, in turn reducing communica- 
tion cost and complexity; because only small messages 
should be passed for coordination, this estimation has 
to be based on small synopses. The use of containment 
estimates in the context of peer-to-peer networking is 
discussed in [15] . 

1.3 b-Bit Minwise Hashing 

The recent development of b-bit minwise hashing |22[ 
|2"31 |2"U1 I2T] provides a solution to the (storage and 
computational) problem of minwise hashing by stor- 
ing only the lowest b bits (instead of 64 bits) of each 
hashed value for a small b. [21] proved that using 
only b — 1 bit per hashed value can achieve at least 
a 21.3-fold improvement (in terms of storage) com- 
pared to using b = 64 bits if the target resemblance 
R > 0.5. This is a very encouraging result which may 
lead to substantial improvement in applications like 
(near)duplicate detection of Web pages [2]. 

On the other hand, when R is small, as shown in 
[21] . one might have to increase b in order to achieve 
an adequate accuracy without substantially increasing 
fc, the number of permutations. 



In fact, machine learning algorithms like SVM require 
(essentially) all pairwise similarities and it is expected 
that most pairs are not too similar. Our concurrent 
work [3T] attempts to combine linear SVM [T5I |2"51 |TT1 
1161130) with b-bit hashing; and our initial experiments 
suggest that b > 4 (especially b — 8) is needed to 
achieve good performance. 

In this paper, we will provide estimators for both the 
standard minwise hashing and b-bit minwise hashing. 

2 Estimators for Minwise Hashing 

Consider two sets Si,S 2 6 Q = {0, 1, 2, D - 1}. 
fx = \Sx\, h = \S 2 \. a = |5in5 2 |. We apply k random 
permutations ttj : O — > O, and record the minimums 
Z\j = min(7rj(5i), z 2 j — xmn(Trj (S 2 ) , j = 1 to k. We 
will utilize the sizes of three disjoint sets: 

fc= = \{zij = z 2>j ,j = 1,2, ...,fc}| (7) 
k < = lOu < ZijJ = 1,2... .,ifc}| (8) 
k > = \i z x,j > Z2,j,j = l,2,...,fc}| (9) 

Note that E(k = ) = kP = , E(k<) = kP K , E(k>) = 
fcP>, Var(k = ) = fcP = (l-P = ), etc. Thus, ^, and 
are unbiased estimators of P = (H} , P< (JSJ) , and P> 
([o]), respectively. For the convenience of presentation, 



we estimate the intersection a = (/i + f 2 )j 



R . 

+R- 



« (t i t\ k =l k ill ± /z)fc = nn> 
a = = (/i + / 2 ) - = (10) 



a<= fx - h 
a> = f 2 - fx 



1 + fc = /fc fc + fc= 
fc< 



k — fc< 
fc> 



(11) 
(12) 



which are asymptotically (for large fc) unbiased esti- 
mators of a. The variances are provided by Lemma [1] 



Lemma 1 



Var(a=) = — 
k 



1 (fx + h - afa{h +f 2 - 2a) 



(fx + h) 2 



0, P 



l(/i + / 2 -a) 2 (/i-a) 
k 



Var(a < ) = - ^ ' J - ;> ^ "> + O [ - 2 ) (14) 



/2 



1 (/i + / 2 -a) 2 (/ 2 -a) 



Var(a > ) = - ^ ' J - ;> ^ "> + Q ( ^ ) (15) 



/i 



Proof: The asymptotic variances can be computed by 
the "delta method" Var(g(x)) w V^ar(a;) [<7'(P(a;))] 2 m 
a straightforward fashion. We skip the details S3 

2.1 The Maximum Likelihood Estimator 

Lemma[T]suggests that the current standard estimator 
a= may be severely less optimal when /2//1 deviates 



from 1. In fact, if we know f\>f 2 ~a (i.e., when the 
resemblance is small but the containment is large), we 
will obtain good results by using a>. The problem is 
that we do not know a in advance and hence we should 
resort to the maximum likelihood estimator (MLE). 

Lemma 2 The MLE, denoted by oImle, is the solu- 
tion to the following equation: 

k= h + h _ k< h_ k> h =Q (16) 

a fx - a f 2 — a 

which is asymptotically unbiased with the variance 



Var (clmle) = 



1 (fx +h~ a? 
k h+h, _|_ _±i 1_ _J_ 



O 



fi—a h-a 



(17) 



Proof: The result follows from classical multinomial 
estimation theory. See Section \3.1\ S2 



2.2 Comparing MLE with Other Estimators 

Figure [5] compares the ratios of the variances of estima- 
tors of a (only using the O (■jg) term of the variance). 
The top- left panel illustrates that when /2//1 < 0.5 
(which is common), the MLE clmle can reduce the 
variance of the standard estimator a = by a large fac- 



tor. When the target containment T 



approaches 



1, the improvement can be as large as 100-fold. 

The top-right panel of Figure [2] suggests that, if f 2 < 
fi , then we should not use a< , because its variance can 
be magnitudes larger than the variance of the MLE. 
The bottom-left panel confirms that if we know the 
containment is very large (close to 1), then we will 
do well by using a> which is simpler than the MLE. 
The problem is of course that we do not know a in 
advance and hence we may still have to use the MLE. 
The bottom-right panel verifies that a> is significantly 
better a < . 



(13) 2.3 Experiment 



For the purpose of verifying the theoretical improve- 
ments, we use two pairs of sets corresponding to the 
occurrences of four common words ( "A - TEST" and 
"THIS - PERSON") in a chunk of real world Web 
crawl data. Each (word) set is a set of document 
(Web page) IDs which contained that word at least 
once. For "A - THE" , the resemblance = 0.0524 and 
containment = 0.9043. For 'THIS - PERSON" , the 
resemblance = 0.0903 and containment = 0.8440. 

Figure [3] presents the mean square errors (MSE) of the 
estimates using a— and clmle- The results verify our 
theoretical predictions: 
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Figure 2: Variance ratios (the lower the better). Ex- 
cept the bottom-right panel, we compare the other 
three estimators, a—, a < and a>, with the MLE clmle- 



• For pairs of low resemblance and high contain- 
ment, the MLE clmle provides significantly bet- 
ter (in these two cases, about an order of magni- 
tude better) results than the standard estimator 
a=. 

• The MLE is asymptotically unbiased. The small 
bias at small k (which is common for MLE in gen- 
eral) vanishes as k increases. 

• The theoretical variances match the simulations. 




We first define: 

u i.b = the number formed by the lowest b bits of Z\ 

u 2,b = the number formed by the lowest b bits of z 2 . 

[2"2"] derived the probability formula Pr (ui t b = U2,b) by 
assuming D = |f2| is large (which is virtually always 
satisfied in practice). We will also need to derive 



P 



b,(t,d) 



Pr (u 1>6 = t, u 2 . b =d), t, d = 0, 1, 2, 2 b - 1 



We follow the convention in (55] by defining 



/i h a 

ri = — , r<\ = — , s = — 
D D D 



(18) 



Instead of estimating a, we equivalently estimate s in 
the context of b-bit hashing. Lemma [3] provides the 
probability formulas as the basic tool. 



Lemma 3 Assume D is very large. 
Pr (ui.b = t, U2,b — d, t < d) 



-P- 



r2 [1 - r 2 ] 



d-t-l 



(19) 

[1 - (r 1 + r 2 - s)]* (n + r 2 - s) 



r 2 \ 



1 - [1 - (n + r 2 - s)Y 



n [1 - ri] t+2 d 1 [1 - (n + r 2 - s)] d (n + r 2 - s) 



i-[i-n] s 



1 - [1 - (n +r 2 -s)] 5 



Pr (ui jb = t, u 2 ,b = d, £ > d) 



(20) 



n [1 - ri] 1 d 1 [1 - (ri + r 2 - s)] d (ri + r 2 - s) 

l-[l-ri] 2 " l-[l-(ri+r 2 -s)] 2b 
^2 [1 - r 2 ] d+2 _ * 1 [1 - (n + r 2 - s)]* (ri + r 2 - s) 



1 



r2 



l2 b 



1 - [1 - (n + ra - «)] 



2 1 ' 



Sample size k Sample size k 

Figure 3: A simulation study using two pairs of real- 
world vectors (of low resemblance and high contain- 
ment) to verify (i) clmle is significantly better than 
d = ; and (ii) the theoretical variances match the simu- 
lations and the bias of the MLE vanishes as k increases. 

3 6-Bit Minwise Hashing 

b-Bit minwise hashing [35] stores each hashed value, 
e.g., Z\ = min(7r(S'i)), z 2 = min(7r(S , 2)), using the 
lowest b bits instead of 64 bits. In this section, we will 
show that because the original b-bit minwise hashing 
only used part of the available information, it can be 
substantially improved. 



Pr (ui,6 = t,u 2 , b = t) 

n 2 b -l 



(21) 



r 2 [1 - r 2 y 



P~ 



n [l - n 



2°-l 



l-[l-r 2 ] z l-\l-riY j 

[1 - (ri + r 2 - s)} 1 (ri + r 2 - s) 
1 - [1 - (n + r 2 - s)f 

Proof: See AvvendixlA^D 

Therefore, we encounter a multinomial probability es- 
timation problem with each cell probability being a 
function of s. Note that the total number of cells, i.e., 
2 b x 2 b , is large especially when b is not small. 

In addition to Pb,(t,d) , we also define the following three 



probability summaries analogous to P=, P<, and P>. 



Fisher Information 1(9) = -E(l"(9)): 



2°-l 



6,(t,i) 



t=0 
2 i> -l 



P&,<, = Pr (ui,6 < u 2: b) = P>,(M) 

2 b -l 

P 6i> = Pr(ui ;6 > U 2 ,b) = ^ P b,(t,ct) 

t>d 



I{6) = -E(l"{6)) = -Y J E{h 



m „ r n 2 m ll 



Far 
_ 1 



(<w) - 



(0) 



(24) 



(25) 



Suppose we conduct k permutations. We define the 
observed counts, kf, t (t,d), kb,=, kb,<, and fc&,>, which 
correspond to Pb,tt,d)> P>,= , Pf>,<, and P>,>, respec- 
tively. Note that k = Y,t,d k b,(t4)- 

[2"2] only used P, j= to estimate R (and hence also 
s). We expect to achieve substantial improvement if 
we can take advantage of the matrix of probabilities 
P>,(M)- Here, we first review some basic statistical 
procedure for multinomial estimation and the classi- 
cal (asymptotic) variance analysis. 



For &-bit hashing, since we have 2 b x 2 b cells with 
probabilities Pb (t,d)> we can either use the full (entire) 
probably matrix or various reduced forms by grouping 
(collapsing) cells (e.g., P&-, P,,<, and P,,>) to ease 
the burden of numerically solving the MLE equation 

(El. 



3.2 Five Levels of Estimators for s 

We first introduce the notation for the following five 
estimators of s: 



3.1 Review Classical Multinomial Estimation 

Consider a table with m cells, each of which is asso- 
ciated with a probability Qi(9), i — 1,2, ...,m. Here 
we assume the probability is parameterized by 9 
(for example, the s in our problem), and the task is 
to estimate 9. Suppose we draw k i.i.d. samples and 
the number of observations from the z-th cell is hi, 
Xa=i ki = k. The joint log-likelihood is proportional 
to 

771 

/(0) = £ki]Gg(a(0). (22) 



The maximum likelihood estimator (MLE), which is 
optimal or asymptotically (for large k) optimal in 
terms of the variance, is the solution 9mle to the MLE 
equation l'{9) = 0, i.e., 

l '(e) = f:kM=0, (23) 



1. §bj denotes the full MLE solution by using all 
m = 2 b x 2 b cell probabilities PbJt,d)i t>d = 

0. 1, 2, 2 b — 1. This estimator will be most ac- 
curate and computationally most intensive. 

2. Sb,do denotes the MLE solution by using m = 2 b + 

2 cells which include the 2 6 diagonal probabilities 
Pb,(t,t)i~t = 0,1,..., 2 b — 1 and two summaries of 
the off-diagonals: P 6 ,< = J2t<d P b,(t,i) and P>,> = 

Ht>d P b,(t,t)- 

3. $b,d denotes the MLE solution by using m = 2 b + l 
cells which include the 2 b diagonal probabilities 
Pb,(t,t)>t — 0, 1) •••) 2 b — 1 and the sum of the rest, 

1. e., P 6 ,< +P&,>. 

4. Sfc,3 denotes the MLE solution by using m = 

3 cells which include the sum of the diagonals 
and two sums of the off-diagonals, i.e., Pb,= = 

ELo 1 P fc,(M)> P &,<> and p b,>- 

5. §b,= denotes the MLE solution by using only m = 
2 cells, i.e., Pf,— and 1 — P&,=. This estimator 
requires no numerical solutions and is the one used 
in the original b-bit minwise hashing paper |22j . 



solving which often requires a numerical procedure. 
For one-dimension problems as in our case, the nu- 
merical procedure is straightforward. 

The estimation variance of 9mle is related to the 



We compare the asymptotic variances of the other four 
estimators, §b,do, H,dt $b,3, and §b t =, with the variance 
of the full MLE s b j in Figures [4] to [12] We consider 
b = 8,4,6, n = 0.8,0.5,0.2 and the full ranges of 
and ^- (which is the containment). Note that the 



improvement of this paper compared to the previous 
standard practice (i.e., Sb l= ) is only reflected in the 
bottom-right panel of each figure. We present other 
estimators in the hope of finding one which is much 
simpler than the full MLE §bj and still retains much 
of the improvement. Our observations are: 

• The full MLE §bj, which uses a matrix of 2 b x 2 b 
probabilities, can achieve substantial improve- 
ments (for example, 5- to 100-fold) compared to 
the standard practice s& != , especially for cases of 
low resemblance and high containment. 

• Two other estimators, §b,do and Sb,3 usually per- 
form very well compared to the full MLE. Sb^do 
uses 2 b + 2 cells and §b,3 uses merely 3 cells: the 
sum of the diagonals and the two sums of the off- 
diagonals. Therefore, we consider §b,3 is likely to 
be particularly useful in practice. 
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Figure 5: Variance ratios for 6 = 8 and n = 0.5. See 
the caption of Figured] for more details. 
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Figure 4: Given a contingency table of size 2 b x 2 b , 
we have defined five different estimators. We compare 
the theoretical variances of the other four estimators, 
Sb,do, Sb,d, Sb,3, and Sb != , with the variance of the full 
MLE §b.f. The bottom- right panel measures the real 
improvement of this paper compared to the previous 
standard practice (i.e., ■§;>,=)■ In this case, the vari- 
ance ratios of about 10 to 100 are very substantial. 
The other three panels are for testing whether simpler 
estimators can still achieve substantial improvements. 
For example, both Sb,do and s& 3 (the left two panels) 
only magnify the variance of the full MLE by small fac- 
tors compared to s;, i= , and hence they might be good 
estimators in lieu of the quite sophisticated full MLE 
solution. In this figure, we consider 6 = 8 and r\ = 0.8. 
Note that s/r2 is the containment. The resemblance 
is upper bounded by r-ijr\. 
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Figure 6: Variance ratios for b — 8 and r\ = 
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Figure 7: Variance ratios for 6 = 4 and r\ 
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Our analysis has demonstrated that the much simpler 
estimator §(,,3, which only uses 3 cells, is often remark- 
ably accurate. We expect it will be used in practice. 
§b,3 involves three summary probabilities: Pb.= , Pb,<, 
and Pb,>- For efficient estimation, we will need to 
use more compact presentations instead of the double 
summation forms. Since we already know Pb,= as de- 
rived in |22j . we only need to derive Pb,< and then Pb,> 
follows by symmetry. After some algebra, we obtain 



2°-l 

Pb,< = Pr (Ui tb < u 2>b ) = ^2 Pr (tti,i, = t, u 2 ,b = d) 

t < d 

1 (ri - s) 



~l-[l-r 2 f l-[l-( ri +r 2 -s)f 
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where 



1 - [1 - (ri +r 2 - s)] 2 - 1 _ [1 -r 2 ] 2 - [1 - (ri + r 2 - s)] 2 - 1 [1 ■ 
f-y + f~2 ~ s ri — s 

[1 - r 1 } 2 '' - [1 - (n +r 2 - s)] 2b ,_ 2 b 

= [1 - ri] 

r 2 — s 

_ [1 - (ri + r 2 - a)] - [1 - (n + r 2 - s)] 26 
rx + r 2 — s 



4 Conclusion 

Computing set or vector similarity is a routine task 
in numerous applications in machine learning, infor- 
mation retrieval, and databases. In Web scale ap- 
plications, the method of minwise hashing is a stan- 
dard technique for efficiently estimating similarities, 
by hashing each set (or equivalently binary vector) in 
a dictionary of size |f2| = 2 64 to about k hashed values 
(k = 200 to 500 is common). The standard industry 
practice is to store each hashed value using 64 bits. 
The recently developed b-bit minwise hashing stores 
only the lowest b bits with small b. b-Bit minwise 
hashing is successful in applications which care about 
pairs of high similarities (e.g., duplicate detection). 

However, many applications involve computing all 
pair wise similarities (and most of the pairs are not 
similar). Furthermore, some applications really care 
about containment (e.g., the fraction that one object 
is contained by another) instead of resemblance. In- 
terestingly, the current standard methods for minwise 
hashing and b-bit minwise hashing perform poorly for 
cases of low resemblance and high containment. 

Our contributions in this paper include the statisti- 
cally optimal estimator for standard minwise hashing 
and several new estimators for b-bit minwise hash- 
ing. For important scenarios (e.g., low resemblance 
and high containment), improvements of about an or- 
der of magnitude can be obtained. The full MLE solu- 
tion for b-bit minwise hashing involves a contingency 
table of 2 b x 2 b cells, which can be prohibitive if b is 
large. Our analysis suggests that if we only use 3 cells, 
i.e., the sum of the diagonals and the two sums of the 
off-diagonals, we can still achieve significant improve- 
ments compared to the current practice. 



A Proof of Lemma [3] 

Consider two sets Si , 52 € ^ = 
Apply a random permutation ir 
S2, and store the two minimums: 
Z2 = min(7r(S2)). Assuming D - 
two basic probability formulas: 



{0,1, 2, ..,£>- 1}. 
ft — > il on a Si, 
zi = min(7r(Si)), 

> 00, [22] provided 



Pr (zi =i, z 2 = j, i < 3) 
=r 2 (ri - s) [1 - r 2 ] J ~ l ~ 1 [1 - (n + r 2 - s)] 1 
Pr(zi = i, z 2 = j, i> j) 



=n[r 2 



i-j-l 



[l - (n + r 2 - s)] J 



(27) 
(28) 



We will also need to derive Pr (z\ = i, Z2 = i). The 
exact expression is given by 



Pr (zi — i, z 2 = i) 



(D-i-l\ (D-i-a\ (D-i-h\ 
\ a-1 A fx-a A f 2 -a J 

(a) (/i-a) (f 2 -a) 

_ a(D - i - 1)1(D - h - f 2 + a)\ 
D\{D-h-f 2 + a-i)\ 

= a UlZo D-h-h+a-t 
D \VZ\D-l-t 

For convenience, we introduce the following notation: 

Si S2 a 

T\ = — , r-2 = — , s = — . 
D D D 

Also, we assume D is large (which is virtually always 
satisfied in practice). We can obtain a reasonable ap- 
proximation (analogous to the Possion approximation 
of binomial) : 

Pr [z\ = i, z 2 = i) = s [1 — (ri + r 2 - s)f 
To verify this, as expected, 

Pr (21 = z 2 ) = ^2 Pr ( Zl = 22 = 



^s[l-(ri+r 2 - S )] 1 



ri + r% — s 



R 



Now we have tools to compute Pr (uij = t, u 2 ^ = d), 
where t, d G {0, 1, 2, 3, 2 b - 1}. 



Pr (ui,b = t, U2M — d, t < d) 
=Pr(«i = t, t + 2 b , t + 2 x 2 b , t + 3 x 2 fc , 

z 2 = d, d + 2 b , d + 2 x 2 b , d + 3 x 2 b , ...) 

Pr (z! =t, 22 = d,d + 2 b ,d + 2 x 2 b ,d + 3 x 2 b ,...,^j 

= ^2Pr(z!=t, z 2 = d + jx2 b ,...,^ 

3=0 

= X) - s ) I 1 - r 2 ] d+jx2 ''- t - 1 [1 - (ri + r 2 - «)]* 
i=0 

=r 2 (n - s) [1 " r ' 2]d ' I [1 - (n + r 2 - s)]* 
l-[l-r 2 ] 2 



Pr 



(zi = t + 2 b , z 2 = d + 2 b , d + 2 x 2 b , d + 3 x 2 b , ) 
^Pr (21 = t + 2 b , z 2 = d + j x2 b , 



Combining the results yields 

Pr (ui,b = t, U2,b = d, t < d) 

[l-r 2 ] d -'- 1 [1 - (n + ra - «)]' 



[1 - (ri + r 2 - a)]' 



2=1 
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=ra(n - a) [1 ^ [1 - (n + r 2 - a)] t+2b 



1 - [1 -r 2 ] 



Pr (mi,6 = i, 1*2,6 = d, £ < d, 21 < 22) 
[l-r 2 ] d -' 

— O) 
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Next, we study 
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Now, we need to compute Pr (u lib = t, u 2 .b = t): 
Pr (ui i(j = £, u 2 ,6 = t) 
= ^ ^ Pr (zi = i + z x 2 b , 22 = t + j x 2 ft ) 



Pr 



(zi=d,..., 22 = t + 2 fc ,t + 2 x 2\i + 3 x 2 ft ,..., £ < d) 
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Pr (wi i() = £, u 2 ,b = 
=Pr = t, U 2 ,b =t,Zl = 22) 
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Pr (mi,6 = t, U2,b =t,zi = 22) 

= E Pr { zi = Z2 = 1 + * x 26 ' •••) 
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By symmetry 

Pr(zi =t,t + 2 b ,t + 2 x 2 b ,..., 

z 2 = d,d + 2 b ,d + 2 x 2 b ..., t < d, z x > z 2 ) 

[l-ri]'^-'- 1 [1 - (n + r 2 - S )] d 



=n(r 2 - s)- 
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Combining the results yields 
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Finally, we re-write the probabilities in terms of R = 
P=, P<, and P > whenever possible: 

Pr (mi,6 = t,U2,b — d, t < d) 
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